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Abstract 

The molecular dynamics simulation package GROMACS runs efficiently on a wide vari¬ 
ety of hardware from commodity workstations to high performance computing clusters. Hard¬ 
ware features are well exploited with a combination of SIMD, multi-threading, and MPI-based 
SPMD/MPMD parallelism, while GPUs can be used as accelerators to compute interactions 
offloaded from the CPU. Here we evaluate which hardware produces trajectories with GRO¬ 
MACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked com¬ 
pute nodes with various CPU / GPU combinations to identify optimal compositions in terms of 
raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other 
criteria. Though hardware prices are naturally subject to trends and fluctuations, general ten¬ 
dencies are clearly visible. Adding any type of GPU significantly boosts a node’s simulation 
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performance. For inexpensive consumer-class GPUs this improvement equally reflects in the 
performance-to-price ratio. Although memory issues in consumer-class GPUs could pass un¬ 
noticed since these cards do not support EGG memory, unreliable GPUs can be sorted out with 
memory checking tools. Apart from the obvious determinants for cost-efficiency like hard¬ 
ware expenses and raw performance, the energy consumption of a node is a major cost factor. 
Over the typical hardware lifetime until replacement of a few years, the costs for electrical 
power and cooling can become larger than the costs of the hardware itself. Taking that into 
account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce 
the maximum amount of GROMACS trajectory over their lifetime. 


1 Introduction 

Many research groups in the field of molecular dynamics (MD) simulation and also computing 
centers need to make decisions on how to set up their compute clusters for running the MD codes. 
A rich variety of MD simulation codes is available, among them CHARMM,® Amber,Desmond, 
LAMMPS,E1 aCEMD,SnAMD, island GROMACSl^El. Here we focus on GROMACS, which is 
among the fastest ones, and provide a comprehensive test intended to identify optimal hardware in 
terms of MD trajectory production per investment. 

One of the main benefits of GROMACS is its bottom-up performance-oriented design aimed 
at highly efficient use of the underlying hardware. Hand-tuned compute kernels allow utilizing 
the single instruction multiple data (SIMD) vector units of most consumer and HPC processor 
platforms, while OpenMP multi-threading and GROMACS’ built-in thread-MPI library together 
with non-uniform memory access (NUMA)-aware optimizations allow for efficient intra-node par¬ 
allelism. By employing a neutral-territory domain decomposition implemented with MPI, a sim¬ 
ulation can be distributed across multiple nodes of a cluster. Beginning with version 4.6, the 
compute-intensive calculation of short-range non-bonded forces can be offloaded to graphics pro¬ 
cessing units (GPUs), while the CPU concurrently computes all remaining forces such as long- 
range electrostatics, bonds, etc., and updates the particle positions.!^ Additionally, through multiple 
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program multiple data (MPMD) task-decomposition the long-range electrostatics calculation can 
be offloaded to a separate set of MPI ranks for better parallel performance. This multi-level het¬ 
erogeneous parallelization has been shown to achieve strong scaling to as little as 100 particles 
per core, at the same time reaching high absolute application performance on a wide range of 
homogeneous and heterogeneous hardware platforms 

A lot of effort has been invested over the years in software optimization, resulting in GRO- 
MACS being one of the fastest MD software engines available todayGROMACS runs on a 
wide range of hardware, but some node configurations produce trajectories more economically 
than others. In this study we ask: What is the ‘optimal’ hardware to run GROMACS on and how 
can optimal performance be obtained? 

Using a set of representative biomolecular systems we determine the simulation performance 
for various hardware combinations, with and without GPU acceleration. For each configuration 
we aim to determine the run parameters with the highest performance at comparable numerical 
accuracy. Therefore, this study also serves as a reference on what performance to expect for a 
given hardware. Additionally, we provide the GROMACS input files for own benchmarks and the 
settings that gave optimum performance for each of the tested node types. 

Depending on the projects at hand, every researcher will have a somewhat different definition 
of ‘optimal,’ but one or more of the following criteria Cl -C5 will typically be involved: 

Cl - the performance-to-price ratio, 

C2 - the achievable single-node performance, 

C3 - the parallel performance or the ‘time-to-solution,’ 

C4 - the energy consumption or the ‘energy-to-solution,’ 

C5 - rack space requirements. 

If on a fixed total budget for hardware, electricity, and cooling, the key task is to choose the 
hardware that produces the largest amount of MD trajectory for the investment. 
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Here we focus on the most suitable hardware for GROM ACS MD simulations. Due to the 


domain-specific requirements of biomolecular MD and in particular that of algorithms and im¬ 
plementation employed by GROM ACS, such hardware will likely not be the best choice for a 
general-purpose cluster that is intended to serve a broad range of applications. At the same time, 
it is often possible to pick a middle-ground that provides good performance both for GROMACS 
and other applications. 

In the next section we will describe the key determinants for GROMACS performance, and 
how GROMACS settings can be tuned for optimum performance on any given hardware. Using 
two prototypic MD systems, we will then systematically derive the settings yielding optimal per¬ 
formance for various hardware configurations. For some representative hardware setups we will 
measure the power consumption to estimate the total MD trajectory production costs including 
electrical power and cooling. Finally, for situations where simulation speed is crucial, we will look 
at highly parallel simulations for several node types in a cluster setting. 

2 Key determinants for GROMACS performance 

GROMACS automatically detects a node’s hardware resources such as CPU cores, hardware thread 
support, and compatible GPUs, at run time. The main simulation tool, mdrun, makes an educated 
guess on how to best distribute the computational work onto the available resources. When exe¬ 
cuted on a single node using its integrated, low-overhead thread-MPI library, built-in heuristics can 
determine essentially all launch configurations automatically, including number of threads, ranks, 
and GPU to rank assignment, allowing to omit some or all of these options. We use the term ‘rank’ 
for both MPI processes and thread-MPI ranks here; both have the same functionality, whereas 
thread-MPI ranks can only be used within the same node. Additionally we use the term ‘threads’ 
or ‘threading’ to refer to OpenMP threads; each rank may thus comprise a group of threads, mdrun 
optimizes the thread layout for data locality and reuse also managing its own thread affinity set¬ 
tings. Default settings typically result in a fairly good simulation performance, and especially in 
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single-node runs and on nodes with a single CPU and GPU often optimal performanee is reaehed 
without optimizing settings manually. However, tuning a standard simulation setup with partiele- 
mesh Ewald^ (PME) eleetrostaties for optimum performanee on a eompute node with multiple 
CPUs and GPUs or on a eluster of sueh nodes typieally requires optimization of simulation and 
launeh parameters. To do this it is important to understand the underlying load distribution and 
balaneing meehanisms.^The eontrol parameters of these allow optimizing for simulation speed, 
without eompromising numerieal aeeuraey. 

Load distribution and balancing mechanisms 

GROMACS uses domain deeomposition (DD) to split up the simulation system into Add = DD^ x 
DDy X DD^ initially equally-sized domains and eaeh of these is assigned to an MPI rank. If 
dynamie load balaneing (DEB) is aetive, the sizes of the DD eells are eontinuously adjusted during 
the simulation to balanee any uneven eomputational load between the domains. 

In simulations using PME, MPMD parallelization allows dedieating a group of Apme ranks to 
the ealeulation of the long-range (reeiproeal spaee) part of the Coulomb interaetions, while the 
short-range (direet spaee) part is eomputed on the remaining Add ranks. A partiele-mesh eval¬ 
uation is also supported for the long-range eomponent of the Eennard-Jones potential with the 
Eennard-Jones PME (EJ-PME) implementation available sinee the 5.0 release.^^^^ The eoarse 
task-deeomposition based on MPMD allows redueing the number of ranks involved in the eostly 
all-to-all eommunieation during 3D-PPT needed by the PME eomputation, whieh greatly reduees 
the eommunieation overheadEor a large number of ranks Ar ank S> 8, peak performanee is 
therefore usually reaehed with an appropriate separation Ara nk = Add + A^pme- The number Apme 
of separate PME ranks ean be eonveniently determined with the g_tune_pme toolQ whieh is dis¬ 
tributed with GROMACS sinee version 4.5. 

When a supported GPU is deteeted, the short-range part of Coulomb and van der Waals inter¬ 
aetions are automatieally offloaded, while the long-range part, as needed for PME or EJ-PME, as 
'From version 5.0 the tune_pnie command of the gmx tool. 
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well as bonded interaetions are eomputed on the CPU. For the PME computation, a fine PME grid 
in combination with a short Coulomb cutoff results in a numerical accuracy comparable to that 
of a coarse grid with a large cutoff. Therefore, by increasing short-range interaction cutoff while 
also increasing the PME grid spacing, GROMACS can gradually shift computational load between 
particle-particle (PP) and PME computation when the two are executed on different resources. This 
is implemented in form of an automated static load-balancing between CPU and GPU or between 
PP and PME ranks, and it is carried out during the initial few hundreds to thousands of simulation 
steps. 

By default, the GROMACS heterogeneous parallelization uses one GPU per DD cell, mapping 
each accelerator to a PP rank. If explicit threading parameters are omitted, it also automatically 
distributes the available CPU cores among ranks within a node by spawning the correct number 
of threads per rank. Both thread count and order takes into account multiple hardware threads 
per core with Hyper-Threading (HT). Using fewer and larger domains with GPU acceleration 
allows reducing overhead associated to GPU offloading like CUDA runtime as well as kernel 
startup and tail overheads]^ On the other hand, since the minimum domain size is limited by cutoff 
and constraint restrictions, using larger domains also ensures that both small systems and systems 
with long-range constraints can be simulated using many GPUs. Often however, performance 
is improved by using multiple domains per GPU. In particular, with more CPUs (or NUMA 
regions) than GPUs per node and also with large-core count processors, it is beneficial to reduce 
the thread count per rank by assigning multiple, ‘narrower’ ranks to a single GPU. This reduces 
multi-threading parallelization overheads, and by localizing domain data reduces cache coherency 
overhead and inter-socket communication. Additional benefits come from multiple ranks sharing 
a GPU as both compute kernels and transfers dispatched from each rank using the same GPU can 
overlap in newer CUDA versions. 

CPU-GPU and domain-decomposition load balancing are triggered simultaneously at the be¬ 
ginning of the run, which can lead to unwanted interaction between the two. This can have a 

^Kernel ‘tail’ is the final, typically imbalanced part of a kernel execution across multiple compute units (of a GPU), 
where some units already ran out of work while others are still active. 
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detrimental effect on the performance in cases where DD cells are, or as a result of DLB become 
close in size to the cutoff in any dimension. In such cases, especially with pronounced DD load 
imbalance, DLB will quickly scale domains in an attempt to remove imbalance reducing the do¬ 
main sizes in either of the x, y, or z dimensions to a value close to the original buffered cutoff. This 
will limit the CPU-GPU load-balancing in its ability to scale the cutoff, often preventing it from 
shifting more work off of the CPU and leaving the GPUs under-utilized. Ongoing work aims to 
eliminate the detrimental effect of this load balancer interplay with a solution planned for the next 
GROMACS release. 

Another scenario, not specific to GPU acceleration, is where DLB may indirectly reduce per¬ 
formance by enforcing decomposition in an additional dimension. With DLB enabled the DD 
needs to account for domain resizing when deciding on the number of dimensions required by the 
DD grid. Without DLB the same number of domains may be obtained by decomposing in fewer 
dimensions. Although decomposition in all three dimensions is generally possible, it is desirable 
to limit the number of dimensions in order to reduce the volumes communicated. In such cases, it 
can be faster to switch off DLB, to fully benefit from GPU offloading. 

Making optimal use of GPUs 

In addition to the load distribution and balancing mechanisms directly controlled by GROMACS, 
with certain GPU boards additional performance tweaks may be exploited. NVIDIA Tesla cards 
starting with the GKllO micro-architecture as well as some Quadro cards support a so-called 
‘application clock’ setting. This feature allows using a custom GPU clock frequency either higher 
or lower than the default value. Typically, this is used as a manual frequency boost to trade available 
thermal headroom for improved performance, but it can also be used to save power when lower 
GPU performance is acceptable. In contrast, consumer GPUs do not support application clocks 
but instead employ an automated clock scaling (between the base and boost clocks published as 
part of the specs). This can not be directly controlled by the user. 

A substantial thermal headroom can be available with compute applications because parts of 
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Figure 1: Thermal throttling of the GPU clock frequency on a GeForce GTX TITAN. Starting 
from a cool, idle state at time t = 0, at about T = 36 °C, the GPU is put under normal GROMACS 
load. The clock frequency is first scaled to 1,006 MHz, but with the temperature quickly increasing 
due to the fan speed being capped at the default 60%, the GPU quickly reaches T = 80 °C, starts 
throttling, gradually slowing down to 941 MHz. 


the GPU board are frequently left under- or unutilized. Graphics-oriented functional units, part 
of the on-board GDDR memory, and even the arithmetic units may idle, especially in case of 
applications relying on heterogeneous parallelization. In GROMACS the CPU-GPU concurrent 
execution is possible only during force computation, and the GPU is idle most of the time outside 
this region, typically for 15-40% of a time step. This leaves enough thermal headroom to allow 
setting the highest application clock on all GPUs to date (see Figure|^on p. 24). 

Increasing the GPU core clock rate yields a proportional increase in non-bonded kernel per¬ 
formance. This will generally translate into improved GROMACS performance, but its magnitude 
depends on how GPU-bound the specific simulation is. The expected performance gain is highest 
in strongly GPU-bound cases (where the CPU waits for results from GPU). Here, the reduction 
in GPU kernel time translates into reduction in CPU wait time hence improved application per¬ 
formance. In balanced or CPU-bound cases the effective performance gain will often be smaller 
and will depend on how well can the CPU-GPU load-balancing make use of the increased GPU 
performance. Note that there is no risk involved in using application clocks; even if a certain 
workload could generate high enough GPU load for the chip to reach its temperature or power 
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limit, automated frequency throttling will ensure that the limits will not be crossed. The upcoming 
GROMACS version 5.1 will have built-in support for checking and setting the application clock of 
compute cards at runtime. 

Indeed, frequency throttling is more common in case of the consumer boards, and factory- 
overclocked parts can be especially prone to overheating. Even standard clocked desktop-oriented 
GeForce and Quadro cards come with certain disadvantages for compute use. Being optimized 
for acoustics, desktop GPUs have their fan limited to approximately 60% of the maximum rotation 
speed. As a result, frequency throttling will occur as soon as the GPU reaches its temperature limit, 
while the fan is kept at < 60%. As illustrated on Figure a GeForce GTX TITAN board installed 
in a well-cooled rack-mounted chassis under normal GROMACS workload starts throttling already 
after a couple of minutes, successively dropping its clock speed by a total of 7% in this case. This 
behavior is not uncommon and can cause load-balancing issues and application slowdown as large 
as the GPU slowdown itself. The supporting information shows how to force the GPU fan speed 
to higher value. 

Another feature, available only with Tesla cards, is the CUDA multi-process server (MPS) 
which provides two possible performance benefits. The direct benefit is that it enables the overlap 
of tasks (both kernels and transfers) issued from different MPI ranks to the same GPU. As a result, 
the aggregate time to complete all tasks the assigned ranks will decrease. E. g. in a parallel run 
with 2000 atoms per MPI rank, using 6 ranks and a Tesla K40 GPU with CUDA MPS enabled 
we measured 30% reduction in the total GPU execution time compared to running without MPS. 
A secondary, indirect benefit is that in some cases the CPU-side overhead of the CUDA runtime 
can be greatly reduced when, instead of the pthreads-based thread-MPI, MPI processes are used 
(in conjunction with CUDA MPS to allow task overlap). Although CUDA MPS is not completely 
overhead-free, at high iteration rates of < 1 ms/step quite common for GROMACS, the task launch 
latency of the CUDA runtime causes up to 10 — 30% overhead, but this can decreased substantially 
with MPI and MPS. In our previous example using 6-way GPU sharing the measured CUDA 
runtime overhead was reduced from 16% to 4%. 


9 


3 Methods 


We will start this section by outlining the MD systems used for performance evaluation. We will 
give details about the employed hardware and about the software environment in which the tests 
were done. Then we will describe our benchmarking approach. 


Benchmark input systems 


Table 1: Specifications of the two MD systems used for benchmarking. 


MD system 

membrane protein (MEM) 

ribosome (RIB) 

symbol used in plots 

• 

★ 

# particles 

81,743 

2,136,412 

system size (nm) 

10.8x10.2x9.6 

31.2x31.2x31.2 

time step length (fs) 

2 

4 

cutoff radii^ (nm) 

1.0 

1.0 

PME grid spacing^ (nm) 

0.120 

0.135 

neighborlist update freq. CPU 

10 

25 

neighborlist update freq. GPU 

40 

40 

load balancing time steps 

5,000 - 10,000 

1,000-5,000 

benchmark time steps 

5,000 

1,000-5,000 


"Table lists the initial values of Coulomb cutoff and PME grid spacing. These are adjusted for optimal load balance 
at the beginning of a simulation. 


We used two representative biomolecular systems for benchmarking as summarized in Table 
The MEM system is a membrane channel protein embedded in a lipid bilayer surrounded by water 
and ions. With its size of ~ 80 k atoms it serves as a prototypic example for a large class of setups 
used to study all kinds of membrane-embedded proteins. RIB is a bacterial ribosome in water 
with ions^ and with more than two million atoms an example of a rather large MD system that is 
usually run in parallel across several nodes. 


Software environment 

The benchmarks have been carried out with the most recent version of GROMACS 4.6 available 
at the time of testing (see 5th column of Table |^. Results obtained with version 4.6 will in the 
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Table 2: Overview of the tested node types with the used software staek. These nodes were eom- 
bined with various GPUs from Table 


Hardware per node 




Software stack (versions) 


Processor(s) 

total 

cores 

RAM 

(GB) 

IB net¬ 
work 

GRO¬ 

MACS 

GCC 

MPF 

library CUDA 

Intel Core i7-4770K 

4 

8 

— 

4.6.7 

4.8.3 


6.0 

Intel Core i7-5820K 

6 

16 

- 

4.6.7 

4.8.3 

- 

6.0 

Intel Xeon E3-1270v2 

4 

16 

QDR 

4.6.5 

4.4.7 

Intel 4.1.3 

6.0 

Intel Xeon E5-1620 

4 

16 

QDR 

4.6.5 

4.4.7 

Intel 4.1.3 

6.0 

Intel Xeon E5-1650 

6 

16 

QDR 

4.6.5 

4.4.7 

Intel 4.1.3 

6.0 

Intel Xeon E5-2670 

8 

16 

QDR 

4.6.5 

4.4.7 

Intel 4.1.3 

6.0 

Intel Xeon E5-2670v2 

10 

32 

QDR 

4.6.5 

4.4.7 

Intel 4.1.3 

6.0 

Intel Xeon E5-2670v2 x2 

20 

32 

QDR 

4.6.5 

4.4.7 

Intel 4.1.3 

6.0 

Intel Xeon E5-2680v2 x 2 

20 

64 

FDR-14 

4.6.7 

4.8.4 

IBM PE 1.3 

6.5 

Intel Xeon E5-2680v2 x 2 

20 

64 

QDR 

4.6.7 

4.8.3 

Intel 4.1.3 

6.0 

AMD Opteron 6380 x2 

32 

32 

- 

4.6.7 

4.8.3 

- 

6.0 

AMD Opteron 6272 x4 

64 

32 

- 

4.6.7 

4.8.3 

- 

- 


"For benchmarks across multiple nodes. On single nodes, GROMACS’ built-in thread-MPI library was used. 


majority of cases hold for version 5.0 since the performance of CPU and GPU compute kernels 
have not changed substantially. Moreover, as long as compute kernel, threading and heterogeneous 
parallelization design remains largely unchanged, performance characteristics and optimization 
techniques described here will translate to future versions, tooj^ 

If possible, the hardware was tested in the same software environment by booting from a com¬ 
mon software image; on external high performance computing (HPC) centers the provided soft¬ 
ware environment was used. Table [2] summarizes the hard- and software situation for the various 
node types. The operating system was Scientific Linux 6.4 in most cases with the exception of the 
FDR-14 Infiniband-connected nodes that were running SuSE Linux Enterprise Server 11. 

Eor the tests on single nodes, GROMACS was compiled with OpenMP threads and its built-in 
thread-MPI library, whereas across multiple nodes Intel’s or IBM’s MPI library was used. In all 
cases, EETW 3.3.2 was used for computing fast Eourier transformations. This was compiled using 
-enable-sse2 for best GROMACS performance|^ Eor compiling GROMACS, the best possible 
^This is has been verified for version 5.1 which is in beta phase at the time of writing. 

^Although FFTW supports the AVX instruction set, due to limitations in its kernel auto-tuning functionality, en- 
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SIMD vector instruction set implementation was chosen for the CPU architecture in question, i. e. 
128-bit AVX with FMA4 and XOP on AMD and 256-bit AVX on Intel processors. 

GROMACS can be compiled in mixed precision (MP) or in double precision (DP). DP treats 
all variables with double precision accuracy, whereas MP uses single precision (SP) for most vari¬ 
ables, as e. g. the large arrays containing positions, forces, and velocities, but double precision for 
some critical components like accumulation buffers. It was shown that MP does not deteriorate 
energy conservation.^^ Since it produces 1.4-2 x more trajectory in the same compute time, it is 
in most cases preferable over DP.^ Therefore, we used MP for the benchmarking. 


GPU acceleration 

Table 3: Some GPU models that can be used by GROMACS. The upper part of the table lists HPC- 
class Tesla cards, below are the consumer-class GeForce GTX cards. Checkmarks (/) indicate 
which were benchmarked. For the GTX 980 GPUs, cards by different manufacturers differing in 
clock rate were benchmarked and ^ symbols). 


NVIDIA 

architec- 

CUDA 

clock rate 

memory 

SP throughput 

PS price 


model 

ture 

cores 

(MHz) 

(GB) 

(Gflop/s) 

(€) (net) 


Tesla K20X^ 

Kepler GKllO 

2,688 

732 

6 

3,935 

2,800 

/ 

Tesla K40‘‘ 

Kepler GKllO 

2,880 

745 

12 

4,291 

3,100 

/ 

GTX 680 

Kepler GK104 

1,536 

1,058 

2 

3,250 

300 

/ 

GTX 770 

Kepler GK104 

1,536 

1,110 

2 

3,410 

320 

/ 

GTX 780 

Kepler GKllO 

2,304 

902 

3 

4,156 

390 

/ 

GTX 780Ti 

Kepler GKllO 

2,880 

928 

3 

5,345 

520 

/ 

GTX TITAN 

Kepler GKllO 

2,688 

928 

6 

4,989 

750 

/ 

GTX TITAN X 

Maxwell GM200 

3,072 

1,002 

12 

6,156 


- 

Quadro M6000 

Maxwell GM200GL 

3,072 

988 

12 

6,070 


Fig.| 

GTX 970 

Maxwell GM204 

1,664 

1,050 

4 

3,494 

250 


GTX 980 

Maxwell GM204 

2,048 

1,126 

4 

4,612 

430 

/ 

GTX 980+ 

Maxwell GM204 

2,048 

1,266 

4 

5,186 

450 

/ 

GTX 980* 

Maxwell GM204 

2,048 

1,304 

4 

5,341 

450 

/ 


“See Figure|^for how performance varies with clock rate of the Tesla cards, all other benchmarks have been done 
with the base clock rates reported in this table. 


GROMACS 4.6 and later supports CUDA-compatible GPUs with compute capability 2.0 or 
higher. Tablej^lists a selection of modern GPUs (of which all but the GTX 970 were benchmarked) 

abling AVX support deteriorates performance on the tested architectures. 
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including some relevant teehnieal information. The SP eolumn shows the GPU’s maximum the- 
oretieal SP flop rate, ealeulated from the base eloek rate (as reported by NVIDIA’s deviceQuery 
program) times the number of eores times two floating-point operations per eore and eyele. GRO- 
MACS exelusively uses single preeision floating point (and integer) arithmetie on GPUs and ean 
therefore only be used in mixed preeision mode with GPUs. Note that at eomparable theoretieal SP 
flop rate the Maxwell GM204 eards yield a higher effeetive performanee than Kepler generation 
eards due to better instruetion seheduling and redueed instruetion lateneies. 

Sinee the GROMACS CUDA non-bonded kernels are by design strongly eompute-bound,!^ 
GPU main memory performanee has little impaet on their performanee. Henee, peak performanee 
of the GPU kernels ean be estimated and eompared within an arehiteetural generation simply from 
the the produet of eloek rate x eores. SP throughput is ealeulated from the base eloek rate, but 
the effeetive performanee will greatly depend on the aetual sustained frequeney a eard will run 
at, whieh ean be mueh higher. At the same time, frequeney throttling ean lead to performanee 
degradation as illustrated in Figure 

The priee eolumn gives an approximate net original priee of these GPUs as of 2014. In general, 
eards with higher proeessing power (Gflop/s) are more expensive, however the TITAN and Tesla 
models have a signifieantly higher priee due to their higher DP proeessing power (1,310 Gflop/s 
in eontrast to at most 210 Gflop/s for the 780Ti) and their larger memory. Note that unless an 
MD system is exeeptionally large, or many eopies are run simultaneously, the extra memory will 
almost never be used. For MEM, k, 50 MB of GPU memory is needed, and for RIB 225 MB. 
Even an espeeially large MD system eonsisting of ~ 12.5 M atoms uses just about 1,200 MB and 
does therefore still fit in the memory of any of the GPUs found in Table 

Benchmarking procedure 

The benehmarks were run for 2,000- 15,000 steps, whieh translates to a eouple of minutes wall 
eloek runtime for the single-node benehmarks. Balaneing the eomputational load takes mdrun up to 
a few thousand time steps at the beginning of a simulation. Sinee during that phase the performanee 
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is neither stable nor optimal, we exeluded the first 1,000- 10,000 steps from measurements using 
the -resetstep or -resethway eommand line switehes. On nodes without a GPU, we always 
aetivated DLB, since the benefits of a balanced computational load between CPU cores usually 
outweigh the small overhead of performing the balancing (see for example Figure black lines). 
On GPU nodes the situation is not so clear due to the competition between DD and CPU-GPU 
load balancing mentioned in Section We therefore tested both with and without DLB in most of 
the GPU benchmarks. All reported MEM and RIB performances are the average of two runs each, 
with standard deviations on the order of a few percent (see Figurej^for an example of how the data 
scatter). 

Determining the single-node performance 

We aimed to find the optimal command-line settings for each hardware configuration by testing 
the various parameter combinations as mentioned in Sectionj^ On individual nodes with Nq cores, 
we tested the following settings using thread-MPI ranks: 

(a) A^rank — Nq 

(b) a single process with Nih = Nc threads 

(c) combinations of W ank ranks with Wh threads each, with Wank x Wh = W (hybrid paralleliza¬ 
tion) 

(d) forW > 20 without GPU acceleration, we additionally checked with g_tune_pine whether 
separate ranks for the long-range PME part do improve performance 

For most of the hardware combinations, we checked (a) - (d) with and without HT, if the processor 
supports it. The supporting information contains a bash script that automatically performs tests 
(a)-(c). 

To share GPUs among multiple DD ranks, current versions of mdrun require a custom -gpu_id 
string specifying the mapping between PP ranks and numeric GPU identifiers. To obtain opti¬ 
mal launch parameters on GPU nodes, we automated constructing the -gpu_id string based on 
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the number of DD ranks and GPUs and provide the corresponding bash script in the supporting 
information. 

Determining the parallel performance 

To determine the optimal performance across many CPU-only nodes in parallel, we ran g_tune_pine 
with different combinations of ranks and threads. We started with as many ranks as cores Nc in 
total (no threads), and then tested two or more threads per rank with an appropriately reduced 
number of ranks as in (c), with and without HT. 

When using separate ranks for the direct and reciprocal space parts of PME = A^dd +Npme) 
on a cluster of GPU nodes, only the A^dd direct space ranks can make use of GPUs. Setting 
whole nodes aside for the PME mesh calculation would mean leaving their GPU(s) idle. To pre¬ 
vent leaving resources unused with separate PME ranks, we assigned as many direct space (and 
reciprocal space) ranks to each node as there are GPUs per node, resulting in a homogeneous, 
interleaved PME rank distribution. On nodes with two GPUs each, e. g., we placed A^ = 4 ranks 
(Mdd = A^pme = 2) with as many threads as needed to make use of all available cores. The num¬ 
ber of threads per rank may even differ for A^dd and A^pme- In fact, an uneven thread count can 
be used to balance the compute power between the real and the reciprocal ranks. On clusters 
of GPU nodes, we tested all of the above scenarios (a) - (c) and additionally checked whether a 
homogeneous, interleaved PME rank distribution improves performance. 

4 Results 

This section starts with four pilot surveys that assess GPU memory reliability (i), and evaluate 
the impact of compiler choice (ii), neighbor searching frequency (iii), and parallelization settings 
(iv) on the GROMACS performance. Prom the benchmark results and the hardware costs we 
will then derive for various node types how much MD trajectory is produced per invested €. We 
will compare performances of nodes with and without GPUs and also quantify the performance 
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dependence on the GPU application clock setting. We will consider the energy efficiency of several 
node types and show that balanced CPU-GPU resources are needed for a high efficiency. We 
will show how running multiple simulations concurrently maximizes throughput on GPU nodes. 
Finally, we will examine the parallel efficiency in strong scaling benchmarks for a selected subset 
of node types. 

GPU error rates 

Opposed to the GeForce GTX consumer GPUs, the Tesla HPC cards offer error checking and 
correction (ECC) memory. ECC memory, as also used in CPU server hardware, is able to detect 
and possibly correct random memory bit-flips that may rarely occur. While in a worst-case scenario 
such events could lead to silent memory corruption and incorrect simulation results, their frequency 
is extremely low. l^^ l ^^l Prior to benchmarking, we performed extensive GPU stress-tests on a total 
of 297 consumer-class GPUs (Table using tools that test for ‘soft errors’ in the GPU memory 
subsystem and logic using a variety of proven test patterns .1^ Our tests allocated the entire available 
GPU memory and ran for > 4,500 iterations, corresponding to several hours of wall-clock time. 
The vast majority of cards were error-free, but for 8 GPUs, errors were detected. Individual error 
rates differed considerably from one card to another with the largest rate observed for a 780Ti, 
where during 10,000 iterations > 50 Million errors were registered. Here, already the first iteration 
of the memory checker picked up > 1,000 errors. On the other end of the spectrum were cards 
exhibiting only a couple of errors over 10,000 iterations, including the two problematic 980+. 
Error rates were close to constant for each of the four repeats over 10,000 iterations. All cards with 
detected problems were replaced. 

Impact of compiler choice 

The impact of the compiler version on the simulation performance is quantified in Table Erom 
all tested compilers, GCC 4.8 provides the fastest executable on both AMD and Intel platforms. On 
GPU nodes, the difference between the fastest and slowest executable is at most 4%, but without 
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Table 4: Frequency of consumer-class GPUs exhibiting memory errors. 


NVIDIA 

model 

GPU memory 
checkei® 

# of cards 

tested 

# memtest 

iterations 

# cards 

with errors 

GTX 580 

memtestGSO 

1 

10,000 

— 

GTX 680 

memtestGSO 

50 

4,500 

- 

GTX 770 

memtestGSO 

100 

4,500 

- 

GTX 780 

memtestCL 

I 

50,000 

- 

GTX TITAN 

memtestCL 

I 

50,000 

- 

GTX 780Ti 

memtestGSO 

70 

4 X 10,000 

6 

GTX 980 

memtestGSO 

4 

4 X 10,000 

- 

GTX 980+ 

memtestGSO 

70 

4 X 10,000 

2 


GPUs it can reach 20%. Table [5] can also be used to normali z e benchmark results obtained with 
different compilers. 

Impact of neighbor searching frequency 

With the advent of the Verlet cutoff scheme implementation in version 4.6, the neighbor searching 
frequency has become a merely performance-related parameter. This is enabled by the automated 
pair list buffer calculation based on the a maximum error tolerance and a number of simulation 
parameters and properties of the simulated system including search frequency, temperature, atomic 
displacement distribution and the shape of the potential at the cutoff.l^ 

Adjusting this frequency allows trading the computational cost of searching for the compu¬ 
tation of short-range forces. As the GPU is idle during list construction on the CPU, reducing 
the search frequency also increases the average CPU-GPU overlap. Especially in multi-GPU runs 
where DD is done at the same step as neighbor search, decreasing the search frequency can have 
considerable performance impact. Figure|^indicates that the search frequency optimum is between 
20 and 70 time steps. The performance dependence is most pronounced for values < 20, where 
performance quickly deteriorates. In our benchmarks we used a value of 40 on GPU nodes (see 
Table [T]). 
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Table 5: GROMACS 4.6 single-node performance with thread-MPI (and CUDA 6.0) using differ¬ 
ent compiler versions on AMD and Intel hardware with and without GPUs. The last column shows 
the speedup compared to GCC 4.4.7 calculated from the average of the speedups of the MEM and 
RIB benchmarks. 


Hardware 

Compiler 

MEM (n^d) 

RIB (n^d) 

av. speedup (%) 

AMD 6380 X 2 

GCC 4.4.7 

14 

0.99 

0 


GCC 4.7.0 

15.6 

I.II 

11.8 


GCC 4.8.3 

16 

1.14 

14.7 


ICC 13.1 

12.5 

0.96 

-6.9 

AMD 6380 X 2 

GCC 4.4.7 

40.5 

3.04 

0 

with 2x GTX 980+ 

GCC 4.7.0 

38.9 

3.09 

-1.2 


GCC 4.8.3 

40.2 

3.14 

1.3 


ICC 13.1 

39.7 

3.09 

-0.2 

Intel E5-2680v2 x 2 

GCC 4.4.7 

21.6 

1.63 

0 


GCC 4.8.3 

26.8 

1.86 

19.1 


ICC 13.1 

24.6 

1.88 

14.6 


ICC 14.0.2 

25.2 

1.81 

13.9 

Intel E5-2680v2 x 2 

GCC 4.4.7 

61.2 

4.41 

0 

with 2x GTX 980+ 

GCC 4.8.3 

62.3 

4.69 

4.1 


ICC 13.1 

60.3 

4.78 

3.5 
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Figure 2: Impact of neighbor searching frequency on the performance on a node with 2xE5- 
2680v2 processors and 2xK20X GPUs. In the MEM benchmark the number of ranks and threads 
per rank was also varied. 


Influence of hybrid parallelization settings and DLB 

The hybrid (OpenMP / MPI) parallelization approach in GROMACS can distribute computational 
work on the available CPU cores in various ways. Since the MPI and OpenMP code paths exhibit 
a different parallel scaling behaviour,!^ the optimal mix of ranks and threads depends on the used 
hardware and MD system, as illustrated in Figure 

For the CPU-only benchmarks shown in the figure (black lines), pure MPI parallelization yields 
the highest performance, which is often the case on nodes without GPUs (see Fig. 4 in Ref.^^. 
For multi-socket nodes with GPU(s) and for nodes with multiple GPUs, the highest performance 
is usually reached with hybrid parallelism (with an optimum at about 4-5 threads per MPI rank, 
colored curves). The performance differences between the individual parallel settings can be con¬ 
siderable: for the single-GPU setting of the MEM system, e. g., choosing 40 MPI ranks results in 
less than half the performance of the optimal settings, which are 4 MPI ranks and 10 threads each 
(24 ns/d compared to 52 see blue line in Figure [^. The settings at the performance optimum are 
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Figure 3: Single-node performance as a function of the number of GPUs (color coded) and of how 
the 40 hardware threads are exploited using a combination of MPI ranks and OpenMP threads. 
Solid lines show performance with, dotted lines without DLB. Test node had 2x E5-2680v2 
processors and 4x GTX 980^ GPUs. Left panel MEM, right panel RIB benchmark system. 
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provided in the benchmark summary Tables [7| [TT| and[T^ 

As described in Sectionespecially with GPUs, DLB may in some cases cause performance 
degradation. A prominent example is the right plot of Figure where the highest RIB perfor¬ 
mances are recorded without DLB when using GPUs. However, there are also cases where the 
performance is similar with and without DLB, as e. g. in the 4-GPU case of the left plot (light 
blue). 


Fitness of various node types 

Tables and |7] list single-node performances for a diverse set of hardware combinations and the 
parameters that yielded peak performance. ‘DD grid’ indicates the number of DD cells per dimen¬ 
sion, whereas Wth’ gives the number of threads per rank. Since each DD cell is assigned to exactly 
one MPI rank, the total number of ranks can be calculated from the number of DD grid cells as 
■^rank = DD^c X DDy X DD^ plus the number A^pme t)f separate PME ranks, if any. Normally the 
number of physical cores (or hardware threads with HT) is the product of the number of ranks and 
the number of threads per rank. For MPI parallel runs, the DLB column indicates whether peak 
performance was achieved with (symbol /) or without DLB (symbol X) or whether the benchmark 
was done exclusively with enabled DLB (symbol / ). 

The ‘cost’ column for each node gives a rough estimate on the net price as of 2014 and should 
be taken with a grain of salt. Retail prices can easily vary by 15-20% over a relatively short 
period. In order to provide a measure of ‘bang for buck,’ using the collected cost and performance 
data we derive a performance-to-price ratio metric shown in the last column. We normali z e with 
the lowest performing setup to get > 1 values. While this ratio is only approximate, it still provides 
insight into which hardware combinations are significantly more competitive than others. 

When a single CPU with 4-6 physical cores is combined with a single GPU, using only thread¬ 
ing without DD resulted in the best performance. On CPUs with 10 physical cores, peak perfor¬ 
mance was usually obtained with thread-MPI combined with multiple threads per rank. When 
using multiple GPUs, where at least W ank = Aqpu ranks is required, in most cases an even larger 
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Table 6: Single-node performanee P of the MEM benehmark on various node types. U = raek 
spaee requirements in units per node, D for desktop ehassis. Priees do not inelude IB network 
adapter. 


u 

processor(s) 
clock rate 

CPUs X 
cores 

GPUs 

DD grid 

X y z 

NpME 

Nth 

DEB 

p 

(ns/d) 

« cost 
(€ net) 

nj/d per 
205 € 

D 

i7-4770K 

1 X 4 

— 

2 

3 

1 

2 

1 


7.4 

800 

1.9 


(3.4-3.9 GHz) 


980* 

1 

1 

1 

- 

8 

- 

26.1 

1,250 

4.3 

D 

i7-5820K 

1 X 6 

— 

3 

3 

1 

3 

1 

/ 

10.1 

850 

2.4 


(3.3-3.6 GHz) 


770 

1 

1 

1 

- 

12 

- 

26.5 

1,170 

4.6 




980* 

1 

1 

1 

- 

12 

- 

32 

1,390 

4.7 

1 

E3-1270v2 

1 X 4 

— 

1 

1 

1 

_ 

8 


5.3 

1,080 

1 


(3.5 GHz) 


680 

1 

1 

1 


8 

_ 

20 

1,380 

3 




770 

1 

1 

1 

- 

8 

- 

20.5 

1,400 

3 

1 

E5-1620 

1 X 4 

680 

1 

1 

1 


8 


21 

1,900 

2.3 


(3.6-3.8 GHz) 


770 

1 

1 

1 


8 


21.7 

1,900 

2.3 




780 

1 

1 

1 


8 


21.8 

1,970 

2.3 




780Ti 

1 

1 

1 

- 

8 

- 

23.4 

2,100 

2.3 




TITAN 

1 

1 

1 

- 

8 

- 

23.8 

2,330 

2.1 

1 

E5-1650 

1 X 6 

680 

1 

1 

1 

__ 

12 

_ 

22.6 

2,170 

2.1 


(3.2-3.8 GHz) 


770 

1 

1 

1 


12 


23.4 

2,170 

2.2 




780 

1 

1 

1 


12 


25 

2,240 

2.3 




780Ti 

1 

1 

1 


12 


27 

2,370 

2.3 




680x2 

2 

1 

1 

_ 

6 


24.8 

2,470 

2.1 




770x2 

2 

1 

1 

- 

6 


25.1 

2,470 

2.1 

1 

E5-2670 

1 X 8 

770 

1 

1 

1 

_ 

16 

_ 

26.9 

2,800 

2 


(2.6-3.3 GHz) 


780 

1 

1 

1 

_ 

16 

_ 

28.3 

2,870 

2 




780Ti 

1 

1 

1 

- 

16 

_ 

29.6 

3,000 

2 




TITAN 

1 

1 

1 

_ 

16 

_ 

29.3 

3,230 

1.9 




770x2 

2 

1 

1 

- 

8 

i/; 

27.6 

3,120 

1.8 

1 

E5-2670v2 

1 X 10 

_ 

4 

5 

1 

_ 

1 

/ 

11.2 

2,480 

0.9 


(2.5-3.3 GHz) 


770 

1 

5 

1 


4 

k 

29 

2,800 

2.1 




780 

1 

1 

1 

_ 

20 

- 

29.8 

2,870 

2.1 




780Ti 

1 

5 

1 

— 

4 

(/' 

31.5 

3,000 

2.2 




TITAN 

1 

1 

1 


20 


32.7 

3,230 

2.1 




980 

1 

5 

1 


4 


33.6 

2,900 

2.4 




770x2 

10 

1 

1 

_ 

2 

/ 

33.7 

3,120 

2.2 




780Tix2 

10 

1 

1 


2 

/ 

35.7 

3,520 

2.1 




980x2 

10 

1 

1 

- 

2 

/ 

36.8 

3,330 

2.3 

4 

E5-2670v2 

2 X 10 

_ 

8 

5 

1 

_ 

1 

/ 

21.4 

3,360 

1.3 


(2.5-3.3 GHz) 


770 

8 

1 

1 


5 

X 

35.9 

3,680 

2 




770x2 

10 

1 

1 


4 

/ 

51.7 

4,000 

2.6 

2 



780Ti 

8 

1 

1 


5 

(/ 

45.5 

4,100 

2.3 




780Tix2 

10 

1 

1 


4 

/ 

56.9 

4,620 

2.5 




780Tix3 

10 

1 

1 

_ 

4 

/ 

61.1 

5,140 

2.4 




780Tix4 

10 

1 

1 

- 

4 

/ 

64.4 

5,660 

2.3 

2 

E5-2680v2 

2 X 10 


8 

2 

2 

8 

1 

/ 

26.8 

4,400 

1.2 


(2.8-3.6 GHz) 


K20Xx2 

8 

1 

1 


5 

X 

55.2 

10,000 

1.1 




K40x2 

8 

1 

1 


5 

X 

55.9 

10,600 

1.1 




980+ 

4 

1 

1 


10 

/ 

52 

4,850 

2.2 




980+ x2 

10 

1 

1 

- 

4 

/ 

62.3 

5,300 

2.4 




980+ x3 

10 

1 

1 

_ 

4 

/ 

65.1 

5,750 

2.3 




980+ x4 

8 

1 

1 

- 

5 

X 

66.9 

6,200 

2.2 

1 

AMD 6272 (2.1) 

4 X 16 

- 

5 

5 

2 

14 

1 

/ 

23.7 

3,670 

1.3 

4 

AMD 6380 

2 X 16 

_ 

5 

5 

1 

7 

1 

/ 

16 

2,880 

1.1 


(2.5 GHz) 


TITAN 

8 

1 

1 


4 

X 

32.5 

3,630 

1.8 




770x2 

8 

1 

1 

_ 

4 

/ 

35.8 

3,520 

2.1 




980+ 

8 

1 

1 

_ 

4 

X 

35.6 

3,330 

2.2 




980+ x2 

8 

1 

1 

- 

4 

X 

40.2 

3,780 

2.2 
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Table 7: Same as Tablebut for the RIB benehmark. 


u 

processor(s) 
clock rate 

CPUs X 
cores 

GPUs 

DD grid 

X y z 

NpME 

Xh 

DEB 

p 

(nfd) 

« cost 
(€ net) 

nfd per 
3600 € 

D 

i7-4770K 

1 X 4 

— 

2 

3 

1 

— 

1 

/' 

0.51 

800 

2.3 


(3.4-3.9 GHz) 


980* 

8 

1 

1 

- 

1 

k 

1.3 

1,250 

3.7 

D 

i7-5820K 

1 X 6 

_ 

3 

3 

1 

3 

1 

/ 

0.69 

850 

2.9 


(3.3-3.6 GHz) 


770 

4 

1 

1 

- 

3 

- 

1.54 

1,170 

4.7 




980* 

1 

1 

1 

- 

12 

- 

1.8 

1,390 

4.7 

1 

E3-1270v2 

1 X 4 

_ 

1 

1 

1 

_ 

8 


0.3 

1,080 

1 


(3.5 GHz) 


680 

1 

1 

1 

_ 

8 

- 

0.89 

1,380 

2.3 




770 

1 

1 

1 

- 

8 

- 

0.91 

1,400 

2.3 

1 

E5-1620 

1 X 4 

680 

1 

1 

1 

_ 

8 

_ 

1.03 

1,900 

2 


(3.6-3.8 GHz) 


770 

1 

1 

1 

_ 

8 

- 

1.02 

1,900 

1.9 




780 

1 

1 

1 

- 

8 

- 

1.06 

1,970 

1.9 




780Ti 

1 

1 

1 

- 

8 

- 

1.14 

2,100 

2 




TITAN 

1 

1 

1 

- 

8 

- 

1.11 

2,330 

1.7 

1 

E5-1650 

1 X 6 

680 

1 

1 

1 


12 

_ 

1.09 

2,170 

1.8 


(3.2-3.8 GHz) 


770 

1 

1 

1 

- 

12 

_ 

1.13 

2,170 

1.9 




780 

1 

1 

1 

_ 

12 

_ 

1.17 

2,240 

1.9 




780Ti 

I 

1 

1 

_ 

12 

_ 

1.22 

2,370 

1.9 




680x2 

2 

1 

1 

— 

6 

/; 

1.4 

2,470 

2 




770x2 

2 

1 

1 

- 

6 

/' 

1.41 

2,470 

2.1 

1 

E5-2670 

1 X 8 

770 

1 

1 

1 

_ 

16 

_ 

1.39 

2,800 

1.8 


(2.6-3.3 GHz) 


780 

8 

1 

1 

_ 

2 

/ 

1.6 

2,870 

2 




780Ti 

1 

1 

1 

_ 

16 

_ 

1.64 

3,000 

2 




TITAN 

4 

1 

1 

_ 

4 

/' 

1.67 

3,230 

1.9 




770x2 

2 

1 

1 

- 

8 

/ 

1.72 

3,120 

2 

1 

E5-2670v2 

1 X 10 


4 

2 

2 

4 

1 

/ 

0.79 

2,480 

1.1 


(2.5-3.3 GHz) 


770 

10 

1 

1 

- 

2 

k 

1.78 

2,800 

2.3 




780 

1 

1 

1 

- 

20 

- 

1.6 

2,870 

2 




780Ti 

5 

1 

1 

— 

4 

/ 

2.06 

3,000 

2.5 




TITAN 

1 

1 

1 

- 

20 

_ 

1.75 

3,230 

2 




980 

5 

1 

1 

— 

4 

/ 

2.22 

2,900 

2.8 




770x2 

10 

1 

1 

- 

2 

k 

2.16 

3,120 

2.5 




780Tix2 

4 

1 

1 

_ 

5 

/' 

2.31 

3,520 

2.4 




980x2 

5 

1 

1 

- 

4 

'/) 

2.34 

3,330 

2.5 

4 

E5-2670v2 

2 X 10 

_ 

8 

2 

2 

8 

1 

/ 

1.54 

3,360 

1.7 


(2.5-3.3 GHz) 


770 

20 

1 

1 

- 

2 

k 

2.71 

3,680 

2.7 




770x2 

8 

5 

1 

_ 

1 

k 

3.41 

4,000 

3.1 

2 



780Ti 

8 

5 

1 


1 

/ 

3.3 

4,100 

2.9 




780Tix2 

8 

1 

1 


5 

k 

4.02 

4,620 

3.1 




780Tix3 

8 

5 

1 

_ 

1 

/' 

4.17 

5,140 

2.9 




780Tix4 

8 

5 

1 

- 

1 

/ 

4.17 

5,660 

2.7 

2 

E5-2680v2 

2 X 10 


10 

3 

1 

10 

1 

/ 

1.86 

4,400 

1.5 


(2.8-3.6 GHz) 


K20Xx2 

20 

1 

1 


2 

k 

3.99 

10,000 

1.4 




K40x2 

20 

1 

1 

- 

2 

k 

4.09 

10,600 

1.4 




980+ 

20 

1 

1 

_ 

2 

k 

3.99 

4,850 

3 




980+ x2 

20 

1 

1 

_ 

2 

k 

4.69 

5,300 

3.2 




980+ x3 

20 

1 

1 

_ 

2 

k 

4.85 

5,750 

3 




980+ x4 

20 

1 

1 

- 

2 

k 

4.96 

6,200 

2.9 

1 

AMD 6272 (2.1) 

4 X 16 

- 

5 

5 

2 

14 

1 

/ 

1.78 

3,670 

1.7 

4 

AMD 6380 

2 X 16 

_ 

5 

5 

1 

7 

1 

/ 

1.14 

2,880 

1.4 


(2.5 GHz) 


TITAN 

16 

2 

1 

- 

1 

k 

2.58 

3,630 

2.6 




770x2 

16 

1 

1 

- 

2 

k 

2.74 

3,520 

2.8 




980+ 

16 

1 

1 

_ 

2 

k 

2.81 

3,330 

3 




980+ x2 

16 

1 

1 

- 

2 

k 

3.14 

3,780 

3 
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number of ranks (multiple ranks per GPU) was optimal. 


Speedup with GPUs 

Tablesand |7] show that GPUs inerease the performance of a compute node by a factor of 1.7 - 3.8. 
In case of the inexpensive GeForce consumer cards, this also reflects in the node’s performance- 
to-price ratio, which increases by a factor of 2-3 when adding at least one GPU (last column). 
When installing a significantly more expensive Tesla GPU, the performance-to-price ratio is nearly 
unchanged. Because both the performance itself (criterion C2, as defined in the introduction) as 
well as the performance-to-price ratio (Cl) are so much better for nodes with consumer-class 
GPUs, we focused our efforts on nodes with this type of GPU. 

When looking at single-CPU nodes with one or more GPUs (see third column of Tables and 
1^, the performance benefit obtained by a second GPU is < 20% for the 80 k atom system (but 
largest on the 10-core machine), and on average about 25 % for the 2 M atom system, whereas the 
performance-to-price ratio is nearly unchanged. 

The dual-GPU, dual-socket E5-2670v2 nodes are like the single-GPU, single-socket E5-2670v2 
nodes with the hardware of two nodes combined. The dual-CPU nodes with several GPUs yielded 
the highest single-node performances of all tested nodes, up to ~ 67 for MEM and 5 for 
RIB on the E5-2680v2 nodes with four GTX 980+. The performance-to-price ratio (Cl) of these 
20-core nodes seems to have a sweet spot at two installed GPUs. 

GPU application clock settings 

Eor the Tesla K20X and K40 cards we determined how application clock settings influence the 
simulation performance (as mentioned previously, GeEorce cards do not support manual adjust¬ 
ment of the clock frequency). While the default clock rate of the K20X is 732 MHz, it supports 
seven clock rates in the range of 614-784MHz. The K40 defaults to 745 MHz and supports four 
rates in the range of 666-875 MHz. Application clock rates were set using NVIDIA’s system 
management interface tool. E. g., nvidia-smi -ac 2600,875 -i 0 sets the GPU core clock rate to 


24 



GPU MHz 



GPU MHz 


Figure 4: Performance as a function of the GPU application clock rate on a node with 2xE5- 
2680v2 processors and K20x (dark blue, dark green) or K40 (light blue, light green) GPUs. Grey 
vertical lines indicate default clock rates. MEM, circles (RIB, stars) benchmarks were run using 
the settings found in Table (Table |^. 


875 MHz and the memory clock rate to 2600 MHz on interface 0. 

Fig. 1^ shows measured performances as a function of the clock rate as well as linear fits (lines). 
For the K40, the maximum clock rate is about 17% higher than the default, and performance 
increases about 6.4% when switching from the default to the maximum frequency using a single 
GPU. The maximum clock rate of the K20X is about 7% higher than the default, resulting in a 
2.8% performance increase. For two K40 or K20X GPUs, using the maximum clock rate only 
results in a 2.1% increased performance, likely because this hardware/benchmark combination is 
not GPU-bound. Highest performance is in all cases reached with the GPU application clock set 
to the highest possible value. 

Energy efficiency 

For a given CPU model, the performanceAVatts ratio usually decreases with increasing clock rate 
due to disproportionately higher energy dissipation. CPUs with higher clock rates are therefore 
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Table 8: Electric power consumption for nodes with up to 4 GPUs when running the RIB bench¬ 
mark. Assuming 5 years of continuous operation and a price of 0.2 € per kWh including cooling, 
the yield in produced trajectory per invested 1,000 € is given in the last column. 


CPU 

installed 

RIB 

power 

energy 

node 

5 yr yield 

cores 

GPUs 

(n>/d) 

draw (W) 

costs (€) 

costs (€) 

(ns/k€) 

E5-2670v2 

- 

(node idle) 

120 

1,051 

3,360 


2 X 10 c. 

- 

1.38 

252 

2,208 

3,360 

453 

2.5-3.3 GHz 

780Ti 

3.3 

519 

4,546 

3,880 

714 

(GCC 4.4.7) 

780Ti X 2 

3.87 

666 

5,834 

4,400 

690 


780Ti X 3 

4.17 

933 

8,173 

5,430 

559 


780Ti X 4 

4.17 

960 

8,410 

5,950 

530 


980 

3.68 

408 

3,574 

3,780 

914 


980 X 2 

4.18 

552 

4,836 

4,200 

844 


980 X 3 

4.2 

696 

6,097 

5,130 

683 


980 X 4 

4.2 

840 

7,358 

5,550 

594 

E5-2680v2 

— 

(node idle) 

150 

1,314 

4,400 


2 X 10 c. 

- 

1.86 

446 

3,907 

4,400 

408 

2.8-3.6 GHz 

980+ 

3.99 

622 

5,449 

4,850 

707 

(GCC 4.8.3) 

980+ X 2 

4.69 

799 

6,999 

5,300 

696 


980+ X 3 

4.85 

926 

8,112 

5,750 

638 


980+ X 4 

4.96 

1,092 

9,566 

6,200 

574 


both more expensive and less energy efficient. On the GPU side, there has been a redesign for en¬ 
ergy efficiency with the Maxwell architecture, providing significantly improved performanceAVatt 
compared to Kepler generation cards. 

For several nodes with decent performance-to-price ratios from Tables [^-|7] we determined the 
energy efficiency by measuring the power consumption when running the benchmarks at optimal 
settings (Tables and |^. On the E5-2670v2 nodes, we measured the average power draw over 
an interval of 300 seconds using a Voltcraft ‘EnergyCheck 3000’ meter. On the E5-2680v2 nodes, 
the current energy consumption was read from the power supply with ipmitool|^ We averaged 
over 100 readouts one second apart each. Power measurements were taken after the initial load 
balancing phase. For the power consumption of idle GPUs, we compared the power draw of idle 
nodes with and without four installed cards, resulting in 27 W (~ 24 W) for a single idle 780Ti 

'http;//ipmitool.sourceforge.net/ 
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Table 9: As Table but for the MEM benchmark 


CPU 

installed 

MEM 

power 

energy 

node 

5 yr yield 

cores 

GPUs 

(n^d) 

draw (W) 

costs (€) 

costs (€) 

(/rs/k€) 

E5-2680v2 

- 

26.8 

446 

3,907 

4,400 

5.89 

2 X 10 c. 

980+ 

52.05 

547 

4,792 

4,850 

9.85 

2.8-3.6 GHz 

980+ X 2 

62.34 

725 

6,351 

5,300 

9.77 

(GCC 4.8.3) 

980+ X 3 

65.1 

824 

7,218 

5,750 

9.16 


980+ X 4 

66.92 

899 

7,875 

6,200 

8.68 


(980). 

While the total power consumption of nodes without GPUs is lowest, their trajectory costs are 
the highest due to the very low trajectory production rate. Nodes with one or two GPUs produce 
about 1.5-2x as much MD trajectory per invested € than CPU-only nodes (see last column in 
Tables and 1^. While trajectory production is cheapest with one or two GPUs, due to the runs 
becoming CPU-bound, the cost rises significantly with the third or fourth card, though it does 
not reach the CPU-only level. To measure the effect of GPU architectural change on the energy 
efficiency of a node, the E5-2670v2 node was tested both with GTX 780Ti (Kepler) and GTX 
980 (Maxwell) cards. When equipped with 1-3 GPUs, the node draws >100 W less power under 
load using Maxwell generation cards than with Kepler. This results in about 20% reduction of 
trajectory costs, lowest for the node with two E5-2670v2 CPUs combined with a single GTX 
980 GPU. Exchanging the E5-2670v2 with E5-2680v2 CPUs, which have f»10% higher clock 
frequency, yields a 52% (44%) increase in energy consumption and 30% (21%) higher trajectory 
costs for the case of one GPU (two GPUs). 

Well-balanced CPU / GPU resources are crucial 

With GPUs, the short-range pair interactions are offloaded to the GPU, while the calculation of 
other interactions like bonded forces, constraints, and the PME mesh, remains on the CPU. To 
put all available compute power to optimum use, GROMACS balances the load between CPU and 
GPU by shifting as much computational work as possible from the PME mesh part to the short- 
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Table 10: Dependence of simulation performance P, cutoff settings, and total power consumption 
on the graphics processing power for the RIB system on a node with 2x 2680v2 CPUs and up 
to 4 GTX 980+ GPUs. The ‘cost ratios’ indicate the floating point operations in this part of the 
calculation relative to the CPU-only case. 


installed 

GPUs 

P 

(n^d) 

tot. power 
draw (W) 

cutoff 

(nm) 

cost ratio 
short range 

cost ratio 

PME 3D EET 

energy efficiency 
(W/ns/d) 

0 

1.86 

446 

1 

1 

1 

240 

1 

3.99 

622 

1.16 

2.6 

0.65 

156 

2 

4.69 

799 

1.38 

3.75 

0.46 

170 

3 

4.85 

926 

1.45 

4.22 

0.42 

191 

4 

4.96 

1,092 

1.61 

5.36 

0.36 

220 


range electrostatic kernels. As a consequence of the offload approach, the achievable performance 
is limited by the time spent by the CPU in the non-overlapping computation where the GPU is left 
idle, like constraints calculation, integration, neighbor search and domain-decomposition. 


Table 10 shows the distribution of the PME and short-range non-bonded workload with in¬ 
creasing graphics processing power. Adding the first GPU relieves the CPU from the complete 
short-range non-bonded calculation. Additionally, load balancing shifts work from the PME mesh 
(CPU) to the non-bonded kernels (GPU), so that the CPU spends less time in the PME 3D EET cal¬ 
culation. Both effects yield a 2.1 x higher performance compared to the case without a GPU. The 
benefit of additional GPUs is merely the further reduction of the PME 3D EET workload (which 
is just part of the CPU workload) by a few extra percent. The third and fourth GPU only reduce 
the CPU workload by a tiny amount, resulting in a few percent extra performance. At the same 
time, the GPU workload is steadily increased, and with it increases the GPU power draw, reflect¬ 
ing in a significantly increased power consumption of the node (see also the other 3- and 4-GPU 
benchmarks in Table [^. 

Hence, GPU and CPU resources should always be chosen in tandem keeping in mind the needs 
of the intended simulation setup. More or faster GPUs will have little effect when the bottleneck 
is on the CPU side. The theoretical peak throughput as listed in Table helps to roughly relate 
different GPU configurations to each other in terms of how much SP compute power they provide 
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and of how much CPU compute power is needed to achieve a balanced hardware setup. At the 
lower end of GPU models studied here are the Kepler GK104 cards: GTX 680 and 770. These are 
followed by GKl 10 cards, in order of increasing compute power, GTX 780, K20X, GTX TITAN, 
K40, GTX 780Ti. The GTX 980 and TITAN X, based on the recent Maxwell architecture, are 
the fastest as well as most power-efficient GPUs tested. The dual-chip server-only Tesla K80 can 
provide even higher performance on a single board. 

The performance-to-price ratios presented in this study reflect the characteristics of the specific 
combinations of hardware setups and workloads used in our benchmarks. Different types of simu¬ 
lations or input setups may expose slightly different ratios of CPU to GPU workload. E. g., when 
comparing a setup using the AMBER force-field with 0.9 nm cutoffs to a CHARMM setup with 
1.2 nm cutoffs and switched van der Waals interactions, the latter results in a larger amount of pair 
interactions to be computed, hence more GPU workload. This in turn leads to slight differences in 
the ideal CPU-GPU balance in these two cases. Still, given the representative choices of hardware 
configurations and simulation systems, our results allow for drawing general conclusions about 
similar simulation setups. 

Multi-simulation throughput 

Our general approach in this study is using single-simulation benchmarks for hardware evaluation. 
However, comparing the performance P on a node with just a few cores to P on a node with many 
cores (and possibly several GPUs) is essentially a strong scaling scenario involving efficiency 
reduction due to MPI communication overhead and/or lower multi-threading efficiency. 

This is alleviated by partitioning available processor cores between multiple replicas of the 
simulated system, which e. g. differ in their starting configuration. Such an approach is generally 
useful if average properties of a simulation ensemble are of interest. With several replicas, the 
parallel efficiency is higher, since each replica is distributed to fewer cores. A second benefit is 
a higher GPU utilization due to GPU sharing. As the individual replicas do not run completely 
synchronized, the fraction of the time step that the GPU is normally left idle is used by other 
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Figure 5: Maximizing throughput by running multiple simulations per node. A: 
Single-simulation performance P of the MEM benchmark on a node with 2xE5- 
2680v2 CPUs using 0, 1, or 2 GTX 980+ GPUs (blue colors) compared to the ag¬ 
gregated performance of 5 replicas (red/black). B: Similar to A, but for differ¬ 
ent node types and benchmark systems (available at http://www.gromacs.org/gpu and 
ftp://ftp.gromacs.org/pub/CRESTA/CRESTA_Gromacs_benchmarks_v2.tgz). GEC — 144k 
atoms GluCE CRESTA benchmark, 1 nm cutoffs, PME grid spacing 0.12 nm. RNA - 14.7 k atoms 
solvated RNAse, 0.9nm cutoffs, PME grid spacing 0.1125nm. VIE - 8k atoms villin protein, 
1 nm cutoffs, PME grid spacing 0.125 nm. In B a 5 fs time step and GROMACS 5.0.4 was used. 


replicas. The third benefit, similar to the case of GPU sharing by ranks of a single simulation, is 
that independent simulations benefit from GPU task overlap if used in conjunction with CUDA 
MPS. In effect, CPU and GPU resources are both used more efficiently, at the expense of getting 
multiple shorter trajectories instead of a single long one. 

Eigure quantifies this effect for small to medium MD systems. Subplot A compares the 
the MEM performance for a single-simulation (blue colors) to the aggregated performance of five 
replicas (red/black). The aggregated trajectory production of a multi-simulation is the sum of the 
produced trajectory lengths of the individual replicas. The single simulations settings are found in 
Table in multi-simulations we used one rank with 40 / Vr anv threads per replica. Eor a single 
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GTX 980, the aggregated performanee of a 5-repliea simulation (red bar) is 47% higher than the 
single simulation optimum. While there is a performanee benefit of ;^25% already for two replieas, 
the effeet is more pronouneed for > 4 replieas. For two 980 GPUs, the aggregated performanee of 
five replieas is 40% higher than the performanee of a single simulation at optimal settings or 87% 
higher when eompared to a single simulation at default settings (//rank = 2, N± = 20). 

Subplot B eompares single and multi-simulation throughput for MD systems of different size 
for an oetaeore Intel (blue bars) and a 16-eore AMD node (green bars). Here, within eaeh repliea 
we used OpenMP threading exelusively, with the total number of threads being equal to the number 
of eores of the node. The benefit of multi-simulations is always signifieant and more pronouneed 
the smaller the MD system. It is also more pronouneed on the AMD Opteron proeessor as eom¬ 
pared to the Core i7 arehiteeture. For the 8 k atom VIL example, the performanee gain is nearly a 
faetor of 2.5 on the 16-eore AMD node. 

As with multi-simulations one essentially shifts resouree use from strong sealing to the embar¬ 
rassingly parallel sealing regime, the benefits inerease the smaller the input system, the larger the 
number of CPU eores per GPU, and the worse the single-simulation CPU-GPU overlap. 

Seetion ?? in the supporting information gives examples of multi-simulation setups in GRO- 
MACS and additionally quantifies the performanee benefits of multi-simulations aeross many 
nodes eonneeted by a fast network. 

Strong scaling 

The performanee P aeross multiple nodes is given in Tables [^-[^ for seleeted hardware eon- 
figurations. The parallel effieieney E is the performanee on m nodes divided by m times the per¬ 
formanee on a single node: = Pm/{tn x Pi). In the spirit of pinpointing the highest possible 

performanee for eaeh hardware eombination, the multi-node benehmarks were done with a stan¬ 
dard MPI library, whereas on individual nodes the low-overhead and therefore faster thread-MPI 
implementation was used. This results in a more pronouneed drop in parallel effieieney from a 
single to many nodes than what would be observed when using a standard MPI library throughout. 
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The prices in Tables |^-|7] do neither include an Infiniband (IB) network adapter nor proportionate 
costs for an IB switch port. Therefore, the performance-to-price ratios are slightly lower for nodes 
equipped for parallel operation as compared to the values in the tables. However, the most impor¬ 
tant factor limiting the performance-to-price ratio for parallel operation is the parallel efficiency 
that is actually achieved. 

The raw performance of the MEM system can exceed 300 on state-of-the-art hardware, 
and also the bigger RIB system exceeds 200 n^d. This minimizes the time-to-solution, however at 
the expense of the parallel efficiency E (last column). Using activated DLB and separate PME 
ranks yielded the best performance on CPU-only nodes throughout. With GPUs the picture is a bit 
more complex. On large node counts, a homogeneous, interleaved PME rank distribution showed 
a significantly higher performance than without separate PME ranks. DEB was beneficial only 
for the MEM system on small numbers of GPU nodes. HT helped also across several nodes in 
the low- to medium-scale regime, but not when approaching the scaling limit. The performance 
benefits from HT are largest on individual nodes and in the range of 5 - 15%. 

The E3-1270v2 nodes with QDR IB exhibit an unexpected, erratic scaling behaviour (see Tdi- 
bles [TT]-[T^ top rows). The parallel efficiency is not decreasing strictly monotonic, as one would 
expect. The reason could be the CPU’s limited number of 20 PCIe lanes, of which 16 are used 
by the GPU, leaving only 4 for the IB adapter. However, the QDR IB adapter requires 8 PCIe 2.0 
lanes to exploit the full QDR bandwidth. This was also verified in an MPI bandwidth test between 
two of these nodes (not shown). Thus, while the E3-1270v2 nodes with GPU offer an attractive 
performance-to-price ratio, they are not well suited for parallel operation. Intel’s follow-up model, 
the E3-1270v3 provides only 16 PCIe lanes, just enough for a single GPU. Eor parallel usage, the 
processor models of the E5-16x0, E5-26x0 and E5-26x0v2 are better suited as they offer 40 PCIe 
lanes, enough for two GPUs plus IB adapter. 
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Table 11: Scaling of the MEM benchmark on different node types with performance P and par¬ 
allel efficiency E. A black ‘ht’ symbol indicates that using all hyper-threading cores resulted in 
the fastest execution, otherwise using only the physical core count was more advantageous. A 
grey ‘ in ’ denotes that this benchmark was done only with the hyper-threading core count (= 2x 
physical). 


No. of 

nodes 

processor(s) 

Intel 

GPUs, 

Infiniband 

DD grid 

X y z 

A^pme / 
node 

Mh 


DEB 

p 

E 

1 

E3-1270v2 

770, 

1 

1 

1 

- 

8 

(htt 

— 

20.5 

1 

2 

(4 cores) 

QDR^ 

2 

1 

1 

- 

8 

(ht) 

:/) 

21.2 

0.66 

4 



4 

1 

1 

- 

8 

(ht) 

:/) 

22.1 

0.27 

8 



8 

1 

1 

- 

8 

(ht) 

(/; 

68.3 

0.42 

16 



16 

1 

1 

- 

8 

(ht) 

;/) 

85.7 

0.26 

32 



8 

4 

1 

- 

8 

tht) 


119 

0.18 

1 

E5-1620 

680, 

1 

1 

1 

— 

8 

ht 

— 

21 

1 

2 

(4 cores) 

QDR 

2 

1 

1 

- 

8 

ht 

'/ 

29 

0.69 

4 



4 

1 

1 

- 

8 

ht 

/ 

46.9 

0.56 

1 

E5-2670v2 

780Tix2, 

10 

1 

1 

— 

4 

ht 

/ 

56.9 

1 

2 

(2x 10 cores) 

QDR 

4 

5 

1 

- 

2 


/ 

74.2 

0.65 

4 



8 

1 

1 

2 

5 


X 

103.4 

0.45 

8 



8 

1 

2 

2 

5 


X 

119.1 

0.26 

16 



8 

4 

1 

2 

5 


X 

164.8 

0.18 

32 



8 

8 

1 

2 

5 


X 

193.1 

0.11 

1 

E5-2670v2 

980x2, 

10 

1 

1 

— 

4 

ht 

vA 

58 

1 

2 

(2x 10 cores) 

QDR 

4 

5 

1 

- 

2 



75.6 

0.65 

4 



8 

5 

1 

- 

2 


X 

96.6 

0.42 

1 

E5-2680v2 

— 

8 

2 

2 

8 

1 

ht 

/ 

26.8 

1 

2 

(2x 10 cores) 

EDR-14, 

4 

5 

3 

10 

1 

ht 

/ 

42 

0.78 

4 



8 

5 

3 

10 

1 

ht 

/ 

76.3 

0.71 

8 



8 

7 

2 

6 

2 

ht 

/ 

122 

0.57 

16 



8 

8 

4 

4 

1 


/ 

162 

0.38 

32 



8 

8 

8 

4 

1 


/ 

209 

0.24 

64 



10 

8 

6 

2.5 

2 


/ 

240 

0.14 

1 

E5-2680v2 

K20Xx2 

8 

1 

1 

— 

5 

ht 

/ 

55.2 

1 

2 

(2x 10 cores) 

(732 MHz), 

4 

5 

1 

- 

4 

ht 

/ 

74.5 

0.67 

4 


EDR-14 

8 

1 

2 

- 

5 


/ 

118 

0.53 

8 



8 

1 

2 

2 

5 


X 

163 

0.37 

16 



8 

4 

1 

2 

5 


X 

226 

0.26 

32 



8 

8 

1 

2 

5 


X 

304 

0.17 


"Note; these nodes cannot use the full QDR IB bandwidth due to insufficient number of PCI Express (PCIe) lanes, 
see p. 
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Table 12: Same as Table [TTj but for the RIB benehmark. 


No. of 

nodes 

processor(s) 

Intel 

GPUs, 

Infiniband 

DD grid 

X y z 

NpME / 
node 

Mh 


DEB 

p 

(ni^d) 

E 

1 

E3-1270v2 

770, 

1 

1 

1 

— 

8 

ffit) 

— 

0.91 

1 

2 

(4 cores) 

QDR^ 

2 

1 

1 

- 

8 

.ht) 

(/) 

1.87 

1.03 

4 



4 

1 

1 

- 

8 

iht) 

(^) 

2.99 

0.82 

8 



8 

1 

1 

- 

8 

ffit) 

(/) 

4.93 

0.68 

16 



16 

1 

1 

- 

8 

ffit) 

(/) 

4.74 

0.33 

32 



16 

2 

1 

- 

8 

.ht) 

(/) 

10.3 

0.35 

1 

E5-2670v2 

780Tix2, 

8 

1 

1 

- 

5 

■ht) 

X 

4.02 

1 

2 

(2x 10 cores) 

QDR 

20 

1 

1 

- 

4 

ht 

X 

6.23 

0.77 

4 



8 

5 

1 

- 

4 

ht 

X 

10.76 

0.67 

8 



16 

10 

1 

- 

2 

ht 

X 

16.55 

0.51 

16 



16 

10 

1 

- 

2 


X 

23.78 

0.37 

32 



16 

10 

2 

- 

2 


X 

33.51 

0.26 

1 

E5-2670v2 

980x2, 

8 

5 

1 

— 

1 

ht 

■V) 

4.18 

1 

2 

(2x 10 cores) 

QDR 

20 

1 

1 

- 

4 

ht 

X 

6.6 

0.79 

4 



8 

5 

1 

- 

4 

ht 

X 

11 

0.66 

1 

E5-2680v2 

— 

10 

3 

1 

10 

1 

ht 

/ 

1.86 

1 

2 

(2x 10 cores) 

EDR-14 

10 

3 

1 

5 

2 

ht 

/ 

3.24 

0.87 

4 



10 

2 

3 

5 

2 

ht 

/ 

6.12 

0.82 

8 



8 

5 

3 

5 

2 

ht 

/ 

12.3 

0.83 

16 



10 

8 

3 

5 

2 

ht 

/ 

21.8 

0.73 

32 



10 

7 

7 

4.69 

2 

ht 

/ 

39.4 

0.66 

64 



16 

10 

6 

5 

1 


/ 

70.7 

0.59 

128 



16 

16 

8 

4 

1 


/ 

128 

0.54 

256 



16 

17 

15 

4.06 

1 


/ 

186 

0.39 

512 



20 

16 

13 

1.88 

2 


X 

208 

0.22 

1 

E5-2680v2 

K20Xx2 

20 

1 

1 

— 

2 

ht 

X 

3.99 

1 

2 

(2x 10 cores) 

(732 MHz), 

10 

8 

1 

- 

1 

ht 

X 

5.01 

0.63 

4 


EDR-14 

10 

8 

1 

- 

2 

ht 

X 

9.53 

0.6 

8 



16 

10 

1 

- 

2 

ht 

X 

16.2 

0.51 

16 



16 

10 

1 

- 

2 


X 

27.5 

0.43 

32 



8 

8 

1 

2 

5 


X 

49.1 

0.38 

64 



16 

8 

1 

2 

5 


X 

85.3 

0.33 

128 



16 

16 

1 
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"Note: these nodes cannot use the full QDR IB bandwidth due to insufficient number of PCIe lanes, see p. 32 


34 












performance (ns/d) 

Figure 6: Benchmark performances in relation to the total hardware investment (net) for invest¬ 
ments up to 10,000 €. MEM (circles) and RIB (stars) symbols colored depending on CPU type. 
Symbols with white fill denote nodes without GPU acceleration. Dotted lines connect GPU nodes 
to their CPU-only counterparts. The grey lines indicate constant performance-to-price ratio, they 
are a factor of 2 apart each. For this plot, all benchmarks not done with GCC 4.8 (see Table[^ have 
been renormalized to the performance values expected for GCC 4.8, i. e. plus i^l9% for GCC 4.7 
benchmarks on CPU nodes and plus ^4% for GCC 4.7 benchmarks on GPU nodes (see Table [^. 
The costs for multiple node configurations include 370 € for QDR IB adapters (600 € per FDR-14 
IB adapter) per node. 


5 Discussion 

A consequence of offloading the short-ranged non-bonded forces to graphics card(s) is that per¬ 
formance depends on the ratio between CPU and GPU compute power. This ratio can therefore 
be optimized, depending on the requirements of the simulation systems. Respecting that, for any 
given CPU configuration there is an optimal amount of GPU compute power for most economic 
trajectory production, which depends on energy and hardware costs. 

Figures and |7] relate hardware investments and performance, thus summarizing the results in 
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Figure 7: Same representation as in Fig. [^(yellow box depicts section plotted there), now focusing 
on the parallel performance across multiple nodes (the small number next to the data points indi¬ 
cates the number of nodes used). The grey lines indicate perfect scaling and constant performance- 
to-price ratio, they are a factor of 2 apart each. A number next to a data point indicates how many 
compute nodes were used in that benchmark. 
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terms of our criteria performance-to-price (Cl), single-node performance (C2), and parallel perfor¬ 
mance (C3). The grey lines indicate both perfect parallel scaling as well as a constant performance- 
to-price ratio; configurations with better ratios appear more to the lower right. Perhaps not unex¬ 
pectedly, the highest single-node performances (C2) are found on the dual-CPU nodes with two or 
more GPUs. At the same time, the best performance-to-price ratios (Cl) are achieved for nodes 
with consumer-class GPUs. The set of single nodes with consumer GPUs (filled symbols in the 
figures) is clearly shifted towards higher performance-to-price as compared to nodes without GPU 
(white fill) or with Tesla GPUs. Adding at least one consumer-grade GPU to a node increases its 
performance-to-price ratio by a factor of about two, as seen from the dotted lines in the figures that 
connect GPU nodes with their GPU-less counterparts. Nodes with HPC instead of consumer GPUs 
(e. g. Tesla K20X instead of GeForce GTX 980) are however more expensive and less productive 
with GROMACS (black dotted lines). 

Consumer PCs with an Intel Core processor and a GeForce GPU in the low-cost regime at 
around 1,000 € produce the largest amount of MD trajectory per money spent. However, these 
machines come in a desktop chassis and lack ECC memory. Even less expensive than the tested 
Core i7-4770K and i7-5830K CPUs would be a desktop equivalent of the E3-1270v2 system with 
i7-3770 processor, which would cost about 600 € without GPU, or a Haswell-based system, e. g. 
with i5-4460 or i5-4590, starting at less than 500 €. 

Over the lifetime of a compute cluster, the costs for electricity and cooling (C4) become a sub¬ 
stantial or even the dominating part of the total budget. Whether or not energy costs are accounted 
for therefore strongly influences what the optimal hardware will be for a fixed budget. Whereas the 
power draw of nodes with GPUs can be twice as high as without, their GROMACS performance 
is increased by an even larger factor. With energy costs included, configurations with balanced 
CPU/GPU resources produce the largest amount of MD trajectory over their lifetime (Tables and 
0 . 

Vendors giving warranty for densely packed nodes with consumer-class GPUs can still be 
difficult to find. If rack space is an issue (C5), it is possible to mount 2 x Intel E26xx v2/3 processors 
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plus up to four consumer GPUs in just 2U standard rack units. However, servers requiring less 
than 3 U that are able to host GeForce cards are rare and also more expensive than their 3-4 U 
counterparts. For Tesla GPUs however, there are supported and eertified solutions allowing for up 
to three GPUs and two CPUs in a 1 U ehassis. 

Small tweaks to reduee hardware eosts is aequiring just the minimal amount of RAM proposed 
by the vendor, which is normally more than enough for GROMACS. Also, chassis with redundant 
power supply adapters are more expensive but mostly unneeessary. If a node fails for any rea¬ 
son, the GROMACS built-in cheekpointing support ensures that by default at most 15 minutes of 
trajeetory produetion are lost and that the simulation ean easily be eontinued. 

For parallel simulations (Figure]^, the performanee-to-priee ratio mainly depends on the paral¬ 
lel efficieney that is aehieved. Nodes with consumer GPUs (e. g. E5-2670v2 -i- 2x780Ti) connected 
by QDR IB network have the highest performance-to-price ratios on up to about eight nodes (dark 
blue lines). The highest parallel performanee (or minimal time-to-solution, C3) for a single MD 
system is reeorded with the lowest lateney intereonneet. This however eomes at the eost of tra- 
jeetories that are 2 - 8 x as expensive as on the single nodes with the best performanee-to-priee 
ratio. 

Figure summarizes best praetiees helping to exploit the hardware’s potential with GRO¬ 
MACS. These rules of thumb for standard MD simulations with PME eleetrostaties and Verlet 
eutoff seheme hold for moderately parallel seenarios. When approaehing the sealing limit of ~ 
100 atoms per core, a more elaborate parameter sean will be useful to find the performanee op¬ 
timum. Unfavorable parallelization settings ean reduee performanee by a faetor of two even in 
single node runs. On single nodes with processors supporting HT, for the MD systems tested, 
exploiting all hardware threads showed the best performance. However, when sealing to higher 
node eounts using one thread per physieal eore gives better performanee. On nodes with Tesla 
GPUs, ehoosing the highest supported applieation cloek rate never hurts GROMACS performanee 
but will typieally mean inereased power eonsumption. Einally, even the eompiler ehoiee ean yield 
a 20% performanee difference with GCC >4.7 produeing the fastest binaries. 
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General hints concerning GROMACS performance 

□ Recent GCC compilers > 4.7 with best SIMD instruction set supported by the CPU 
produce the fastest binaries (Table[^ 

□ In single node runs with up to 12-16 cores, A^rank = 1 and A^th = Nq performs best, on 
Intel even across sockets 

□ On single nodes, thread-MPI often offers better performance than a regular MPI library 

□ Hyper-threading is beneficial with high enough atoms/core count (> 2500) and in 
OpenMP-only runs (A^rank = 1) 

□ Use A^rank X A^th = Nq and ensure that thread pinning works correctly (check the mdrun 
log output) 

□ Check parallel efficiency E on multiple nodes by comparing to the single-node per¬ 
formance (Tables [TT| - [T^ 

□ The time and cycle accounting table in the mdrun log file helps discovering perfor¬ 
mance issues 

Method and algorithmic tweaks^ 

□ Neighbor search frequency of 20-80 is recommended, > 50 values are typically 
useful in parallel GPU runs (Figure]^ 

□ In highly parallel simulations, overhead due to global communication is reduced with 
lower frequencies for energy calculation, temperature and pressure coupling 

□ Consider using h-bonds constraints with 2fs time step to reduce computational and 
communication cost 

□ Virtual sites allow 4-5 fs long time step 

□ Tweak LINCS settings at high parallelization 

□ Try increasing PME order to 5 (from the default 4) with a proportionally coarser grid^^ 

On CPU nodes 

□ With high core count, A^rank = A^c and Wth = 1 performs best and g_tune_pme conve¬ 
niently determines the optimal number of separate PME ranks. 

On GPU nodes 

□ Try if switching DLB off improves performance (Figure]^ 

□ With A^cpu > 2, using A^rank > A^cpu often improves performance (Figure]^. 

□ On Tesla K20, K40, K80, highest application clock rate can be used (Figure]^ 

□ Multi-simulation increases aggregate performance (Figure |5) 

□ The parallel performance across many nodes can be improved by using a homoge¬ 
neous, interleaved PME node separation (p. (15) 


Figure 8: GROMACS performance checklist. Number of MPI ranks, Arank; number of OpenMP 
threads, Nth', number of CPU cores, Nq. 
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For the researcher it does not matter from which hardware MD trajectories originate, but when 
having to purchase the hardware it makes a substantial difference. In all our tests, nodes with good 
consumer GPUs exhibit the same (or even higher) GROMACS performance as with HPC GPUs — 
at a fraction of the price. If one has a fixed budget, buying nodes with expensive HPC instead 
of cheap consumer GPUs means that the scientists will have to work with just half of the data 
they could have had. Consumer GPUs can be easily checked for memory integrity with available 
stress-testing tools and replaced if necessary. As consumer-oriented hardware is not geared toward 
non-stop use, repeating these checks from time to time helps catching failing GPU hardware early. 
Subject to these limitations, nodes with consumer-class GPUs are nowadays the most economic 
way to produce MD trajectories not only with GROMACS. The general conclusions concerning 
hardware competitiveness may also have relevance for several other MD codes like CHARMM,^ 
LAMMPS,®or NAMD,I^ which like GROMACS also use GPU acceleration in an offloading ap¬ 
proach. 
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